Feedback should be send to
goran.milovanovic@datakolektiv.com. These notebooks
accompany the ADVANCED ANALYST - Foundations for Advanced Data Analytics
in R DataKolektiv
training.
Our task in this session is to learn the basics of the Plotly package, the industry standard in interactive data visualization. We will learn the basics of Plotly by diving even deeper into Exploratory Data Analysis of course! Interactive data visualizations are especially important today when means of delivering data analytics are mostly web-based and rarely based on print or slide-decks alone. One day, if you decide to continue your journey in R, you might realize just how powerful is the combination of Plotly with RStudio Shiny web applications..! We will also start introducing the concept of A/B testing in data analytics in this session, beginning with basic t-test for differences between means of two independent groups.
NOTE on the usage of pipe operator in this session:
In this session, I will use the %>% pipe operator,
popular in dplyr and
originally developed in the magrittr
package, in place of the previously used “new” native R pipe operator
|>.
Boston Housing DataWe will use the (in)famous Boston Housing Data data set in this session.
# Load the tidyverse package: ggplot2 is a part of it!
library(tidyverse)
# The path to your CSV file
data_dir <- paste0(getwd(), "/_data/")
filename <- "BostonHousing.csv"
filepath <- paste0(data_dir, filename)
# Load the data into R
housing <- readr::read_csv(filepath)
Rows: 506 Columns: 14── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Glimpse its structure to ensure it has arrived in full
glimpse(housing)
Rows: 506
Columns: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124…
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, 0…
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 8…
$ chas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524, 0.…
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004, 6.377, 6.…
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 39.0, …
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921,…
$ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307, 307, 307,…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 2…
$ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.63, 386.71,…
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45, 13.27, 1…
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 2…
Understanding the Boston Housing data:
crim: per capita crime rate by townzn: proportion of residential land zoned for lots over
25,000 sq.ft.indus: proportion of non-retail business acres per
town.chas: Charles River dummy variable (1 if tract bounds
river; 0 otherwise)nox: nitric oxides concentration (parts per 10
million)rm: average number of rooms per dwellingage: proportion of owner-occupied units built prior to
1940dis: weighted distances to five Boston employment
centersrad: index of accessibility to radial highwaystax: full-value property-tax rate per $10,000ptratio: pupil-teacher ratio by townb: 1000(Bk - 0.63)^2 where Bk is the proportion of
blacks by townlstat: % lower status of the populationmedv: Median value of owner-occupied homes in
$1000’s# Generate summary statistics for the dataset
summary(housing)
crim zn indus chas nox
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
rm age dis rad tax
Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0
Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2
3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0
Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
ptratio b lstat medv
Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
Median :19.05 Median :391.44 Median :11.36 Median :21.20
Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00
EDA Question: How does the number of rooms
(rm) influence the median value of homes
(medv)?
Step-by-step Explanation:
Initialize the Plot: We start by calling plot_ly(), where we specify the dataset and the axes.
plot_ly(data = housing, x = ~rm, y = ~medv)
data = housing tells Plotly to use the Boston Housing
dataset;x = ~rm sets the x-axis to represent the number of
rooms;y = ~medv sets the y-axis to represent the median home
value.This is similar to ggplot2::aes() mapping, right?
Now we want to specify the Plot Type: define the plot type and mode.
This is what we will do:
type = 'scatter', mode = 'markers'
type = 'scatter' creates a scatter plot;mode = 'markers' uses points to represent data
points.Next, customize Markers: adjust the appearance of the markers.
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)')
size = 10 makes each marker size consistent;color = 'rgba(255, 182, 193, .9) sets a light pink
color with some transparenc;Add text labels: include text labels that appear when hovering over each marker.
text = ~paste("MEDV:", medv, "Rooms:", rm)
text = configures the hover text to show both the
median value and number of rooms.Finally, we configure the Layout: finalize the plot with titles and axis labels.
layout(title = "Scatter Plot of Median Home Value vs. Number of Rooms",
xaxis = list(title = "Number of Rooms"),
yaxis = list(title = "Median Home Value ($1000s)"))
title sets the main title of the plot;xaxis and yaxis define the titles for the
x-axis and y-axis, respectively.Here is the complete code for our Plotly interactive scatter plot:
plot_ly(data = housing, x = ~rm, y = ~medv, type = 'scatter', mode = 'markers',
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)'),
text = ~paste("MEDV:", medv, "Rooms:", rm)) %>%
layout(title = "Scatter Plot of Median Home Value vs. Number of Rooms",
xaxis = list(title = "Number of Rooms"),
yaxis = list(title = "Median Home Value ($1000s)"))
Analysis. This scatter plot helps us visually assess the relationship between the number of rooms and home values. Generally, we might observe a positive trend where more rooms indicate a higher median home value, suggesting that larger homes in Boston are more expensive.
EDA Question. What is the variability and distribution of property tax rates (tax) across different properties?
First the complete code, and then a step-by-step explanation:
plot_ly(housing, x = ~tax, type = "histogram",
marker = list(color = 'rgba(100, 250, 100, 0.7)')) %>%
layout(title = "Histogram of Property Tax Rates",
xaxis = list(title = "Property Tax Rate"),
yaxis = list(title = "Count"))
Initialize the plot:
plot_ly(boston_housing, x = ~tax)
x = ~tax sets the x-axis to represent the property tax
rates.Specify the plot type: define that you want to create a histogram.
type = "histogram"
type = "histogram" tells Plotly to generate a
histogram.Customize appearance: customize the color and style of the histogram bars.
marker = list(color = 'rgba(100, 250, 100, 0.7)')
color = 'rgba(100, 250, 100, 0.7)' sets a
semi-transparent green color.Configure the Layout: Add titles and axis labels to make the plot informative.
layout(title = "Histogram of Property Tax Rates",
xaxis = list(title = "Property Tax Rate"),
yaxis = list(title = "Frequency"))
title provides a main title.xaxis and yaxis define the labels for the
x-axis and y-axis.Analysis. The histogram reveals the distribution of property tax rates among properties in Boston. It helps identify the most common tax rates and detect any outliers or unusual spikes in the data.
EDA Question. How does the crime crim
rate correlate with age, and how are these factors related
to whether the property tract bounds Charles River or not (see:
chas variable)?
Here is the complete code:
plot_ly(data = housing,
x = ~age,
y = ~crim,
type = 'scatter',
mode = 'markers',
marker = list(size = ~medv/3,
color = ~chas),
text = ~paste("Crime Rate:", crim,
"<br>Age:", age,
"<br>Median Value:", medv,
"<br>Charles River:", chas)) %>%
layout(title = "Bubble Chart of Crime Rate vs. Age",
xaxis = list(title = "Age"),
yaxis = list(title = "Crime Rate"))
NA
Log-scales for both crim and age might be
of some help here:
plot_ly(data = housing,
x = ~log(age),
y = ~log(crim),
type = 'scatter',
mode = 'markers',
marker = list(size = ~medv/3,
color = ~chas),
text = ~paste("Crime Rate:", crim,
"<br>Age:", age,
"<br>Median Value:", medv,
"<br>Charles River:", chas)) %>%
layout(title = "Bubble Chart of Crime Rate vs. Age",
xaxis = list(title = "log(Age)"),
yaxis = list(title = "log(Crime Rate)"))
Analysis. The highest crime rates are indeed found in the areas where oldest properties are situated which seem to be the cheapest at the same time. There is no apparent evidence whether bounding the Charles River banks make any difference
type and mode and
their usage in plot_ly()In Plotly, particularly when using the plot_ly()
function in R, the type and mode parameters
serve distinct roles in defining how data is visualized:
type parameterDefinition: The type parameter in
plot_ly() specifies the type of plot or chart that you want
to create. This determines the overall visual representation of the
data.
Common Types: These include
scatter, bar, box,
histogram, heatmap, surface,
scatter3d, mesh3d, and others.
Usage Example: If you want to create a line
plot, you would set type = 'scatter' and then specify
mode = 'lines' (explained below under mode).
For a simple bar chart, you would use
type = 'bar'.
mode parameterDefinition: The mode parameter is
used primarily with scatter plots (including line charts and bubble
charts) to define how the data points are connected or represented
visually within the given type.
Common Modes: For type = 'scatter',
common mode values include:
markers: Displays data points as individual markers
(dots).lines: Connects data points with lines.text: Displays data points as text labels.lines+markers: Uses both lines and markers to represent
data points.lines+markers+text: Combines lines, markers, and text
to display the data.Usage Example: To create a plot that shows both
the data points and the lines connecting them, you would use
plot_ly(type = 'scatter', mode = 'lines+markers').
Example in R
Here’s a simple example demonstrating the use of type
and mode in plot_ly():
# Sample data
data <- data.frame(
x = 1:100,
y = runif(100)
)
# Line plot
plot_ly(data,
x = ~x,
y = ~y,
type = 'scatter',
mode = 'lines') %>%
layout(title = 'Line Plot')
# Scatter plot with markers
plot_ly(data,
x = ~x,
y = ~y,
type = 'scatter',
mode = 'markers') %>%
layout(title = 'Scatter Plot with Markers')
# Scatter plot with markers
plot_ly(data,
x = ~x,
y = ~y,
type = 'scatter',
mode = 'markers+lines') %>%
layout(title = 'Scatter Plot with Markers')
For a simple bar plot we can use type='bar' with
mode='lines':
# Scatter plot with bars
plot_ly(data,
x = ~x,
y = ~y,
type = 'bar',
mode = 'lines') %>%
layout(title = 'Simple Bat Plot')
The t-test is a statistical test that is used to determine whether there is a significant difference between the means of two groups. It is commonly used when you want to compare the average performance, measurements, or outcomes between two groups under different conditions or treatments. The t-test provides a way to check if the differences observed in the data are likely to be genuinely reflecting a difference in the population, or if they could just be due to random variation.
There are several types of t-tests, but the most common are:
For our analysis, we’ll focus on the independent samples t-test as we are comparing two different subsets of data:
Let’s move through the steps of our analysis:
crim (crime rate)
variable.crim is greater
than or less than or equal to its median.medv (median value
of owner-occupied homes) between the two groups.First, we will load the data, calculate the necessary statistics, and perform the t-test.
Find the median of crim:
# Calculate the median of the 'crim' variable
median_crim <- median(housing$crim)
print(median_crim)
[1] 0.25651
Use dplyr to create two new data sets that we need:
# Subset the data into two groups
low_crim <- housing %>%
dplyr::filter(crim <= median_crim) %>%
dplyr::select(crim, medv)
high_crim <- housing %>%
dplyr::filter(crim > median_crim) %>%
dplyr::select(crim, medv)
head(low_crim)
head(high_crim)
Perform t-test in R using t.test():
# Perform an independent t-test on 'medv'
t_test_results <- t.test(x = low_crim$medv,
y = high_crim$medv)
# Print the results
print(t_test_results)
Welch Two Sample t-test
data: low_crim$medv and high_crim$medv
t = 6.1202, df = 452.59, p-value = 2.027e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.281241 6.385162
sample estimates:
mean of x mean of y
24.94941 20.11621
This code will output the results of the t-test, which includes the t-statistic, degrees of freedom, p-value, and confidence interval of the difference in means.
Interpretation:
T-statistic: This value indicates the calculated
difference in means between the two groups (low crime rate and high
crime rate areas) relative to the spread or variability of their scores.
A higher t-statistic indicates a greater difference between
groups.
P-value: The p-value is extremely small, far less
than 0.05, which is a common threshold for statistical significance in
social sciences. This suggests that the differences in median home
values between areas with crime rates higher than the median and those
with crime rates lower or equal to the median are statistically
significant.
Conclusion:
Based on the t-test, we can conclude that there is a statistically significant difference in the median values of homes between areas with relatively higher crime rates compared to those with lower crime rates. This implies that, on average, areas with lower crime rates tend to have higher median home values. This result is quite intuitive as safer neighborhoods are generally more desirable and can drive higher property values. To illustrate:
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2024.
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.